Voice Quality Control Using Perceptual Expressions for Statistical Parametric Speech Synthesis Based on Cluster Adaptive Training
نویسندگان
چکیده
This paper describes novel voice quality control of synthetic speech using cluster adaptive training (CAT). In this method, we model voice quality factors labeled with perceptual expressions such as “Gender,” “Age” and “Brightness.” In advance, we obtain the intensity scores of the perceptual expressions by conducting a listening test, which evaluates differences of voice qualities between synthetic speech of average voice and that of the target. Then we build perceptual expression (PE) clusters that we call PE models (PEM) under the conditions that the average voice model is used as the bias cluster and the PE intensity scores are employed as the CAT weights. In synthesis, we can generate controlled synthetic speech by the linear combination of PEMs and the existing speaker’s model. Subjective results demonstrate that the proposed method can control the voice qualities with PEs in many cases and the target synthetic speech modified by PEMs achieves comparatively good speech quality.
منابع مشابه
Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملDeep neural network-based statistical parametric speech synthesis system using improved time-frequency trajectory excitation model
This paper proposes a deep neural network (DNN)-based statistical parametric speech synthesis system using an improved time-frequency trajectory excitation (ITFTE) model. The ITFTE model, which efficiently reduces the parametric redundancy of a TFTE model, improved the perceptual quality of the vocoding process and the estimation accuracy of the training process. However, there remain problems ...
متن کاملSpeaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis
Training a high quality acoustic model with a limited database and synthesizing a new speaker’s voice with a few utterances have been hot topics in deep neural network (DNN) based statistical parametric speech synthesis (SPSS). To solve these problems, we built a unified framework for speaker adaptive training as well as speaker adaptation on Bidirectional Long ShortTerm Memory with Recurrent N...
متن کاملSpeaker and language adaptive training for HMM-based polyglot speech synthesis
This paper proposes a novel technique for speaker and language adaptive training for HMM-based statistical parametric polyglot speech synthesis. Language-specific context-dependencies in the system are captured using CAT with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by CMLLR-based transforms. This framework allows multi-speaker/multi-la...
متن کاملOutput-Based Objective Measure for Non-Intrusive Speech Quality Evaluation
This paper describes a newly developed output-based method for non-intrusive evaluation of speech quality of voice communication systems, and evaluates its performance. The method, which uses only the output of the system, is based on measuring perceptually motivated objective auditory distances between the voiced parts of the speech signal whose quality to be evaluated to appropriately matchin...
متن کامل